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Abstract 


This article was motivated by the desire to improve Markov chain Monte Carlo 
methods for spatial survival models in which the locations of individuals in space are 
known. For a dataset comprising information on n individuals, standard methods 
of MCMC-based inference involve computing the inverse of an n x n matrix at each 
iteration. However with a judicious choice of auxiliary variables on a regular grid with 
m prediction points it will be shown how to fit an essentially equivalent model but with 
a substantially reduced computational cost. For a fixed output grid, the computational 
cost of the new method is reduced from 0{rfi) to 0{n); the cost of increasing the output 
grid size being O(mlogm). Furthermore, the new method simultaneously solves the 
problem of spatial prediction of functions of the latent field, which for standard methods 
usually presents a further computational challenge. We apply the new method to a 


spatial survival dataset previously analysed in Henderson et al. (2002) and show how 


the new method can be applied to spatial and spatiotemporal geostatistical datasets 
with the same computational benefits. 
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1 Introduction 


This article was motivated by the following problem in spatial survival analysis. Suppose 
we have a spatial survival dataset with n observations in which the times of censored or 
uncensored events have been recorded along with their location in space. The statistical 
model to be htted to these data includes: covariate effects, (3, through a linear-predictor 
X/3, where X is a design matrix; a specified functional form for the baseline hazard, ho{t]U}) 
(where t is time and oj are parameters of hg); and a set of spatially-continuous and correlated 
Gaussian frailties, Y. Suppose it is desired to fit a parametric proportional hazards model 
to these data (noting at the outset that the ideas discussed here are applicable in other 
important situations, see Section]^; the model for the baseline hazard is, 

= exp{X/3 -h F}ho(t;a;), 


where V’ = (/3, w). The form of the hazard function determines both the density and survival 
functions (see below) and in turn these quantities feed into the likelihood given in Equation 
With an appropriate choice for the priors, it is therefore possible to use Markov chain 
Monte Carlo (MCMC) methods (Metropolis et ah, 1953 Hastings, 1970t Gilks et ah, 1995 


Gamerman and Lopes, 2006) to draw samples from the posterior density and hence perform 


Bayesian inference for this class of models. The problem is that the cost of computing the 
likelihood is 0{iY), which makes this method impractical for use with even moderately sized 
datasets without a lot of patience on the side of the user. The question this article addresses 
is: how can we make Bayesian inference via MCMC for this class of models tolerable? 


The solution to the 0{n^) problem proposed in this article was inspired by considering the 
purpose of the spatially correlated frailties. The purpose of the frailty term in non-spatial 
survival models is to capture any unexplained variation in the outcome of interest, that is 
variation that cannot be explained by the available covariates. Spatially correlated frailties 
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are used when it is thought that the hazard of individuals close together in space (condi¬ 
tional on their individual covariates) is similar and we wish to take this into account when 
computing the likelihood. But if we suspect individuals close-by in space to be similar in 
some sense, do we need individual-level spatially correlated frailties? The solution proposed 
in this article is to replace the spatially correlated frailties in Equation ([^ by a (sometimes) 
much larger set of auxiliary Es in such a way as to actually reduce computational cost. 
This is achieved by assuming that individuals who are very close together share the spatially 
correlated component of their frailties, but are allowed to retain individual non-spatial frail¬ 
ties. The sharing of the auxiliary frailties in this way is similar to specifying a hierarchical 


model for the data, but they are introduced in such a way so that Fourier methods (Wood 


and Chan, 1994) can be used for matrix operations, rendering the resulting computations 


0{m logm) where m is the number of auxiliary frailties - but with with the added advantage 
that computation of the posterior for hxed m increases as 0{n). 

The main contribution of this article is therefore a method that makes it possible to £t 
spatial survival models to much larger datasets than would normally be practical and with 
the additional beneht of providing a solution to the problem of spatial prediction (that 
is predicting the latent spatial held Y at locations where we do not have data), which 


otherwise is a further computationally intensive step in the inferential process, see Section 2.2 


Furthermore, the main idea is easily extensible to spatial and spatiotemporal geostatistical 
datasets as discussed in Section The new method requires the frailties to be defined on a 
regular grid and also that F is a stationary-Gaussian process. 


It is increasingly common for survival datasets to include the spatial (and/or temporal) 
coordinates of observations. Example application areas of spatial survival methods in the 


literature have included: infant mortality in Minnesota (Banerjee et ah, 2003), smoking 


cessation in cure-rate/time-to-relapse models (Banerjee and Carlin, 2004), air pollution and 


mortality in Los Angeles (Jerrett et al., 2005), American cancer data at the county level 
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(Diva et al., 2008), the timing of U.S. House members’ position announcements on the 


North American Free Trade Agreement Darmofal (2009) and cancer mortality of Hiroshima 


atomic bomb survivors (Tonda et ah, 2012). 


The complex nature of spatial survival modelling has historically meant there have been 
more papers using Bayesian methods than Frequentist methods. Exceptions to this rule 


include the article by Paik and Ying (2012), who introduce a composite likelihood method for 


inference and the article by Li and Lin (2006), who employ spatial semiparametric estimating 


equations. Others exceptions include articles developing scan statistics designed for spatial 


survival data such as Huang et al. (2007) and Bhatt and Tiwari (2014). 


In the present article we continue the Bayesian tradition and although the ensuing discussion 
will neither concentrate on semiparametric proportional hazards nor on proportional odds 


models; we note that the main idea is easily extensible in these directions (Li and Ryan 


2002 Banerjee and De^ 2005; Diva et ah, 2008). The present article concentrates on purely 
spatially referenced data, rather than on spatio-temporally referenced data as discussed by 


Banerjee and Carlin (2003); we note that the extension to this model class is also possible. 


As in Li and Ryan (2002), in the present article we will use log-Gaussian frailties and our 


discussion will focus on the spatial proportional hazards model as was investigated by Hen- 
derson et al. (2002), Banerjee et al. (2003) and Diva et ^ (2008). Another interesting, but 


at best tentatively related work, is Zhao and Hanson (2011) who induce spatial correlation 


through a prior based on a mixture of spatially dependent Polya trees - the method proposed 
in the current article is tentatively related as it will make use of a judiciously chosen prior 
on the random effects for spatial prediction. 

Whilst the issue of handling the computational burden of large spatial survival datasets has 


received little attention in the literature, one example being Hennerfeind et al. (2006), see 


below: there has been some progress in the classical geostatistical literature. In the context 


of modelling spatiotemporal dependence in ocean temperatures, Higdon (1998) proposes a 
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process convolution approach to modelling the random effects. Fuentes (2007) avoids the 


expensive matrix computations involved by adopting an approximate likelihood approach in 
the spectral domain, a concern that has been raised about this and similar methods is the 


quality of the approximation (Banerjee et ah, 2008). Banerjee et ah (2008) use predictive 


process models: they place knots at a selection of observation locations and also possibly 
some non-observation locations, replacing individual spatial random effects with interpo¬ 
lated values from the process on the lower-dimensional knots. Whilst they seek to use an 
m smaller than the original n to ease matrix computations, the approach in the present 
article does not adopt that strategy - for example in Section we sample 16384 random 
effects rather than 1043, but computation time is over 5 times faster. A further method 
for approximating spatial processes that has received some attention in the literature is by 


using low/fixed-rank approximations to the covariance matrix as in Cressie and Johannesson 


(2008), Wikle (2010) and Rodrigues and Diggle (2012). Whilst low/fixed rank strategies are 
computationally effective they can be difficult to set up including choosing an appropriate 
rank and specifying priors; spatial dependence from such models is also more difficult to 
interpret. The technique of circulant embedding used in the present article has also been 
used to aid matrix computations for spatial processes in various other contexts including 


inference for log-Gaussian Cox processes: Mpller et al. (1998), Brix and Diggle (2001) and 


Taylor et al. (2013); and for Gaussian processes on a lattice with missing values: Stroud 


et al. (2014). 


Wikle (2002), Royle and Wikle (2005) and Paciorek ( 2007[ ) in the context of geostatistical 
analysis suggest a similar strategy to the method proposed in the present article, although 
their random effects are interpolated from a spectral representation of the process; in con¬ 
trast, our approach is to handle the random effects directly using the 2-dimensional discrete 
Fourier transform as a tool for matrix computations. On a second point, these authors did 
not consider the issues that arise from wrap-around effects, which in the present article we 
handle by at least doubling the observation bounding box in each direction, as in Figure 
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Park and Liang (2012) and Ganggang et al. (2012) also use a similar strategy to that pre¬ 


sented in the present article using the example of a simple Gaussian geostatistical model; 
they advocate Gaussian Markov random helds (GMRF) to dehne the covariance structure 
on their computational grid. Whilst for very small neighbourhood sizes their methods are 
efficient on this grid, for larger neighbourhood sizes, which are required for approximating 


Gaussian helds with larger spatial dependence (see for example Taylor and Diggle (2014)), 
they suggest Fourier methods to perform matrix computations. For these models with larger 
neighbourhood sizes, in which they resort to Fourier methods, the restriction to GMRF- 


induced covariance structures is unnecessary as will be shown in the present article. Park 


and Liang (2012) and Ganggang et al. (2012) do not propose a concrete solution to deal with 


wrap around effects. 


Of the methods that address the problem of inference for large non-Gaussian datasets (specif¬ 


ically spatial survival datasets), Hennerfeind et al. (2006) propose to use the reduced-rank 


methodology developed by|Kammann and Wand (2003); their methods are implemented i 


m 


the BayesX package (Brezger et ah, 2005). As pointed out by Park and Liang (2012), whilst 


such methods (utilising a low-dimensional approximation to the latent Gaussian process) are 
effective at reducing computation time, for large datasets the dimension of the approxima¬ 
tion can still be very high, meaning these methods will not work well. As mentioned above 
there are other issues for low-rank methods including choosing the rank, specifying priors 
and interpreting spatial dependence. 


The present article can therefore be seen as a generalisation of Park and Liang (2012) and 


Ganggang et al. (2012) to spatial survival and other non-Gaussian spatial and spatiotemporal 


geostatistical models in which the (stationary) covariance function is able to adopt any 
permissible form (e.g. exponential, Matern etc.) on the grid: not merely GMRF form, 
which is a special case. As well as being computationally efficient, our proposed method also 
has the advantage of interpretability with respect to the parameters of the latent Gaussian 
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process. 


As stated above, for the purpose of this article the methodological focus will be spatial 
survival modelling, we refer the reader to Section in which we show how the method can 
be extended to more general spatial and spatiotemporal geostatistical datasets. To that 
end, we begin in Section with a brief introduction to spatial proportional hazards survival 
models and spatial prediction. In Section we introduce the new model including details on 
computations using Fourier methods, inference via MCMC and re-visit spatial prediction. 
In Section]^ we apply the method to the leukaemia dataset of Henderson et ah (2002). The 
article concludes with a discussion in Section |6l 


2 Spatial Proportional Hazards Survival Models 


In this section we give a brief overview of spatial proportional hazards survival models and 
discuss two methods for spatial prediction. 

Survival analysis (‘event-history analysis’, ‘duration analysis’ or ‘reliability analysis’ as it is 
referred to in other disciplines) is concerned with the modelling and prediction of time-to- 
event data: for each individual, we observe the time of an event. We do not observe the 
exact survival time of all individuals, rather we might only know that an individual was 
alive the last time they were seen by a doctor, for example; such events are termed ‘(right) 
censored’. The concept of censoring is what makes survival analysis statistically interesting. 
There are three types of censoring: right censoring is where the event happened after some 
known time, left censoring is where the event happened sometime before a known time and 
interval censoring is where the event happened during a known interval of time. We refer 


the reader to Cox and Oakes (1984), Klein and Moeschberger (2003) or Klein et ah (2013) 


for further details on the preliminaries of survival analysis. 
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2.1 Survival Models 


Let the occurrence time of an event of interest, T, be a random variable. There are three main 
quantities of interest in survival analysis: the density function, f{t), giving the probability 
density function of T; the hazard function. 


h{t) 


, (F(t<T <t +At) 
At^o \ At X S{t) 




giving the instantaneous failure rate at time t conditional on survival up to time f; and the 
survival function, S{t), giving the probability of survival after time t. These three quantities 
are related as follows: 

S{t) = 1 - f{x)dx = exp{-H{t)} and h(t) = (1) 

Other quantities of interest are the cumulative hazard, H{t) = h{s)ds and the lifetime 
distribution function, F(t) = 1 — S(t). 

A (parametric) proportional hazards spatial survival model is completely dehned by intro¬ 
ducing a model for the hazard function, e.g.. 


h{ti] %Ij, Yi) = exp{Xj/3 -f Yi}ho{ti; u), (2) 

Here a subscript i denotes belonging to individual i, Xi is a vector of covariates, (3 are the 
covariate effects, u are parameters of the baseline hazard function (which has a tractable 
form, see below) and in this article T) is the value of a spatially continuous, stationary 
latent Gaussian held Y at the location of observation i. The parameters of the covariance 
function of the latent held Y will be denoted rj. We will assume that the latent held has 
been parametrised in such a way that E[exp(y)] = 1, which is easily achieved by setting 
the mean of Y to be —a‘^/2, where is the marginal variance of the held. Examples 



of suitable spatial covariance functions useful in epidemiological applications include the 
Matern and Exponential models. We require ho to be ‘tractable’ in the sense that we must 
be able to evaluate it given parameters cn, and further we must also be able to evaluate 
the baseline cumulative hazard, Hq = h{s)ds. Examples of suitable functional forms 
for the baseline hazard include the exponential, Weibull, Gompertz, Gompertz-Makeham, 
log-normal, gamma and F models. 

Let = {f3, u, rj, EJ). We can derive the density function for a parametric PH spatial survival 
model using the right-hand expression in Equation 

=exp{Xi/3+ Yi}ho{t-,u;)exp{-exp{Xi/3+ Yi}Ho{t;u)}; (3) 

The corresponding lifetime distribution is, 

F{t-,^i) = 1 - exp {-exp{Xi/3 -F Yi}Ho{t-,uj)} . (4) 

Let ti be the data from the ith individual, which in the case of an interval censored obser¬ 
vation is represented as a vector, ti = Gonditional on the latent held, Yi,..., W, 

we will assume that observations are independent. The likelihood in this case splits into 
contributing components from left, right and interval censored data as well as uncensored 
data: 

a'. 

i uncensored i left censored i right censored i interval censored 

(5) 
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2.2 Spatial Prediction 


Spatial prediction, the process by we predict properties of the latent field at locations where 
we do not have any data, can be achieved either in-line with the MCMC scheme or post- 
MCMC. 

For the post-MCMC approach, having fitted a spatial snrvival model, snppose now that 
we wish to predict the latent field Y at some additional locations Fi,... this can be 
achieved nsing the following integral, 

7r(Fi:m|data) = / 7r(y'i;m|Fh:n, data)7r(Yi;„, r7|data)dYi:nd77. 


Assnming that Yi-^m is conditionally independent of the data given Yi:„, we can obtain an 
unbiased estimate of the above as 


1 A 


7r(yi:m|data) 7r(Yi:m|F/*2, 


n 


i=l 


( 6 ) 


where for example is the ith sample of the parameter 7] from the MCMC chain. Since 
7r(yi:m, is jointly Gaussian by assumption, the conditional 'n'{Yi,m\Y}^l,r]^^^) is avail¬ 


able in exact form, see for example Rue and Held (2005); the Monte Carlo approximation in 


Equation ([^ can thus be seen as a mixture of Gaussians. Unbiased estimates of quantiles of 
the marginals can be obtained using 


x(f}|data)df} « i ^ / a(l' |r«,,,l‘))df}, 


2 = 1 


there being established numerical methods for evaluating the contributing integrals on the 
right hand side. Whilst the above computations are tractable, in general they are computa¬ 
tionally expensive. The cost of computing each conditional distribution, 7r(Yi:m|u/:n,is 
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0{n^). 


The in-line method, described by Diggle et ah (1998) for example, is more straightforward 


in principle and firstly involves running the MCMC chain to sample from 7r(y, /3, w, T^jdata) 
until convergence. At this point the parameter vector is augmented with ys and a Gibbs 
scheme is used to sample from 7r(y, F, oj, r^jdata) using the decomposition, 

7r(y, y, (3, oj, r^ldata) = 7r(y |y, f3,uj, rj, data)7r(y, /3,a;, r^jdata). 


noting that under this model, conditional independence properties imply 7r(y, (3, u, r/ldata) = 
7r(y, /3, a;, r^iy, data). The MCMC sampler already in use can continue to be used to draw 
samples from the density 7r(y,/d, a;, r^ldata) and at each ‘retain’ iteration, sampling from 
7r(y |y, /3, w, T], data) is a straightforward application of the conditional distribution of a mul¬ 
tivariate Gaussian, see (Diggle et ah, 1998, pp. 309) or Rue and Held] (2005). For increasing 
numbers of observations, the dominating cost of this method is 0{n^). 


3 Auxiliary Variables for Spatial Survival Models 


In this section we show how by making a slight modification to model (|^, we are able to 
make huge computational gains, both in terms of analysing our data and for the purpose of 
spatial prediction. The main idea is to replace the individual spatially correlated frailties 
by a collection of auxiliary frailties (and possibly some additional non-spatial frailties) so 
that although the number of parameters in our model typically is much larger, the compu¬ 
tational complexity is massively reduced. We describe how these computational gains are 
achieved using Fourier methods, discuss a generic gradient-based adaptive MCMC algorithm 
for efficient inference and re-visit spatial prediction for the new method. 


The most common use of auxiliary variable methods (Edwards and Sokal, 1988 Besag and 
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Green, 1993; Higdon, 1998 Damien et al., 1999 Pitt and Shephard, 2001) in Bayesian sam¬ 


pling schemes such as MCMC is in the situation where it is desired to sample from the 
density of a random variable 6 conditional on some data, 7r(0|data), but direct sampling 
from this density is difficult. If it is possible to introduce an additional set of variables '& in 
such a way that sampling from 7i{6, 'd|data) is easier, then are known as auxiliary variables. 
Auxiliary variables are a very important technique in Bayesian sampling schemes and have 


been applied in many areas including binary/multinomial regression (Holmes et ah, 2006), 


capture/recapture modelling (Pollock, 2002) survival analysis/multiple imputation (Faucett 
et ah, 2002), stationary time series models (Pitt and Walker] 2005), variable selection (Nott 


and Green, 2004), mixture sampling (Friihwirth-Schnatter and Friihwirth, 2007) and sam¬ 


pling jump processes (Rao and Teh, 2013) among many others. In the present article we 


refer to auxiliary variables in the same spirit as Pitt and Shephard (2001), in that they are 


additional variables ‘present to aid the task of simulation’. 


3.1 The Model 


Our frailties will be dehned as piecewise constant on a fine, regular 2™'i x 2™'^ (where mi, m 2 G 
Z>i) grid of cells, Q, that cover an extended version of the bounding box of all observations 


in space (see Section 3.2 for further details). The value of the stationary Gaussian field Y 
at the centroid of each of these cells will be denoted Fi,... where m = 


The basic spatial survival model we will use is very similar to that in Equation 


h(ti;'ip,Ygiii) = exp{Xi/? Yg[i\}ho(ti-,uj), 


(7) 


where all terms are as before, but where Q[i\ G {1,..., m} denotes the index of the grid cell 
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to which observation i belongs. A further extension of this model might allow 


h{ti]'il),Yg[i^,Ui) = exp{Xi/3 + yg[i] + Ui}ho{ti]Uj), 


( 8 ) 


where the Ui ~ N(0,(T^) are individual-specihc non-spatial frailties. The idea of including 
both spatial and non-spatial random effects is common in the geostatistics literature, see 
for example Diggle et ah (2002) and Diggle and Ribeiro Jr. (j2007), and was used in spatial 


survival models in Darmofal (2009). 


During the MCMC scheme, we will not sample the YiS directly, rather we work with a vector 
of transformed variables, T = (Ti,..., T^)^, such that 






= —a 


72 + S72 


^ Ti ^ 


\^m J 


(9) 


1 /2 

where S7 is the Cholesky decomposition of the covariance matrix, S,,, obtained by evaluat¬ 
ing the covariance function at each of the centroids of the grid cells in Q] note the dependence 
on T]. This is really the key ingredient to the new method, since there is a very sensible prior 
for T, namely the multivariate N(0,1) prior. The reason this is such a good prior is that 
if we wanted to simulate a multivariate Gaussian variable with mean —a‘^j‘2 and variance 
matrix then this can be achieved precisely by hrst simulating T ~ N(0,1) and then using 
Equation (|^ to instantiate an appropriate {hi,..., TA}- 


3.2 Computation Using Fourier Methods 

In this section we discuss the use of Fourier methods and how these can lead to a substan¬ 
tial computational saving despite a potentially large increase in the dimensionality of the 
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problem. 


In Section |3.1| it was mentioned that the bounding box is extended, the sense in which this 
is achieved is illustrated in Figure In this diagram, the polygon (a triskaidecagon) in 
the bottom left represents what will be referred to as the ‘observation window’, inside of 
which all of the observations are located; technically, this observation window is not actually 
required: the purpose of it is to illustrate a bounded region on the plane within which the 
data are contained (note the choice of this polygon is arbitrary and the region neither needs 
to be convex nor even connected). Surrounding the observation window, there is a dark 
bounding box containing the observation window (and by extension, the observations) and a 
lighter extended bounding box, containing the bounding box. The location of the bounding 
box within the extended bounding box is arbitrary: it could equally well have been in the 
centre, for example. 


The process Y will be represented on a hne regular grid Fi,... Y^, partitioning the extended 
bounding box into m = rectangles, on which Y will be treated as piecewise constant. 

This ‘extended grid’ is notionally wrapped on a torus and we use a toroidal distance metric, 
again illustrated in Figure to compute the distance between the centroids of any two rect¬ 
angular cells on the extended grid. The toroidal metric is the minimum distance between two 
points, either travelling directly or around the minor and/or major radii. A precise dehnition 


is given in Mpller et ah (1998). We use the toroidal distances in specifying the covariance 


matrix, on the extended grid. The restriction to 2™^ x 2 '"^ grids is the optimal choice for 
computational purposes, but this is flexible. For example, there exist algorithms for the con¬ 
struction of computational ‘plans’ for fast implementation of the discrete Fourier transform 


(DFT) on other grid sizes, see the FFTW library (Frigo and Johnson, 2011| and R wrapper 


library ( |Krey et ah 2011). In working on the extended grid, rather than on a grid over the 
bounding box, we effectively eliminate wrap-around effects in the observation window since 
the distance between any two points inside it is the same as the planar Euclidean distance 
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- for such pairs of points it is always shorter to travel directly and not around the torus. In 
other words the covariance between two points in the observation window, computed using 
the toroidal distance metric, will be correct. 



Figure 1: Diagram illustrating how the observation window (the triskaidecagon in the 
bottom left corner) is initially embedded in a bounding box (the dark square) and then 
inside the extended bounding box (the light square). The solid, dashed and dotted lines 
show three different routes on the notional torus between two points in the observation 
window; the shortest of these routes is used in computing the covariance between any two 
points in the extended bounding box. 


A full discussion of how the DFT is used in matrix computations on block circulant matrices 


is given in Chapter 2 of Rue and Held (2005) or Taylor and Diggle (2014), but very briefly 


the idea is to use the spectral decomposition of into the product 


= FEF 


H 


where the superscript H denotes the conjugate transpose. Matrix-vector products such as Fv 
and F^v for some vector v are available as a suitably normalised discrete Fourier transform 
(or respectively inverse discrete Fourier transform) of v. 
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On the extended grid, the matrix will have block circulant structure, and it is this 
property that allows us to make use of the DFT for computations such has computing the 
matrix square root, or inversion. In brief, all of the information about the second order 
structure of the process Y on the extended grid is contained in a 2^^ x 2”^^ matrix called 
a ‘base matrix’, say. As to why Fourier methods are so useful: for example, the two- 
dimensional DFT of gives a second matrix of the same size containing the eigenvalues of 
computed in 0(m logm) time. These base matrices are not to be manipulated in the 
ordinary sense: they represent a massive condensing of information, since on a 2™'^ x 2”^^ 
grid the full covariance matrix would have 

The main disadvantage of Fourier methods is that large values of the spatial decay param¬ 
eter, compared with the size of the observation window can lead to a non-positive dehnite 
covariance matrix for the frailties. One solution to this issue is to extend the grid further, 
for example by tripling or quadrupling the bounding box in each direction, with an associ¬ 
ated increase in computational cost. However for the usual doubling of the bounding box 
in each direction, the occurrence of non-positive definite matrices in an MCMC run can be 
prevented by using an informative prior for the spatial decay parameter that does not place 
much weight outside of roughly 1/5 of the width of the observation window. It is the opinion 
of the present author that this is only a very minor concern, for to obtain reliable estimates 
of the spatial decay parameter, we require replication of the latent process Y and therefore 
an observation window large enough to permit this to occur. 

3.3 Inference Via MCMC 

In this section we discuss MCMC for model ([^. Our Bayesian model for spatial survival 
data includes the chosen hazard form, the likelihood from Equation and a set of priors for 
the model parameters. We do not in fact typically sample u and r] directly, rather because 
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these parameters often only have support on the positive real line, we perform MCMC on 
a transformed scale, for example appropriate transforms might be a) = logo; and fj = logp. 
By Bayes’ Theorem, the product of the prior and likelihood is proportional to the posterior: 


7r(/d, a), 7], r|data) 


7r(data|/d, oj,r],T)7i{(3,u;,ri,T) 
7r(data) 


oc 7r(data|/3, cn, rj, uj, rj, T); 


the quantity 7r(data) is the marginal likelihood. Note that the conditional independence 
properties of this model imply that 7r(data|/3, a), p, T) = 7r(data|/3, a), T). 


We suggest using Markov chain Monte Carlo methods (Gilks et ah, 1995 Gamerman and 


Lopes, 2006) to draw from the posterior, specihcally advanced MCMC schemes that make 


use of gradient information to inform the proposal kernel. The scheme that we consider 
here is a mix of adaptive random walk and Metropolis-adjusted Langevin kernels: Langevin 
kernels for (3, u and T and a random walk kernel for f). The reasons we advocate a random 
walk kernel for fj are (i) because the gradient with respect to this parameter can be difficult 
to evaluate for general covariance functions and there is a non-negligible cost associated with 
computing it, which would be required for a Langevin proposal; and (ii) the parameters fj 
(in particular the spatial decay parameter) are typically not very well identihed by the data 


in any case (Zhang, 2004) - so it is not clear the extra effort of computing the gradient with 
respct to fj is worthwhile. There are of course other choices for proposal kernel for example 


Hamiltonian schemes (Neal, 2011 Girolami and Calderhead, 2011), which should be more 
efficient if tuned correctly, but one main advantage of random walk and Langevin kernels is 
that there are theoretical results ( [Roberts and Rosenthal 2001) that can be used to guide 
the tuning process, which otherwise can be both difficult and tiresome. 


The MCMC method we use is an example of a Metropolis-Hastings sampling scheme in which 
having initialised the chain at, ,a;W,r)W , the Rh step of the algorithm involves 

drawing a candidate p*, T*} from a proposal density, q, and accepting it, i.e., setting 
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with probability 


min < 1 


7r(/3*, cD*, fj*, r*|data) u\ fj*, T*) 

’ 7r(/3h-i)^ a;(*-i)^ rh-i)|data) q{l3*,oj*, fj*, r*|/3h-i)^ rh-b) 


Let C = {(3,u,7],T). We suggest the following proposal: 




( 10 ) 


where 


/ (q rpP-l) I aiog{7r(C(* ^W)} \ 




V 

r(*-i) + '^Er 


(i-i) 


aiog{7r(c(^-i)|y)} 

dr 


and E = 


0 

0 


chplf, 


\ 


h^Er y 


( 11 ) 


In Equation (11), E^ (;;j is an approximation to the negative inverse of the observed information 


matrix, {—E[X(/3,a;)]}“^ conditional on an estimated L and fj, and similarly for E^j and 
Ep. Due to the size of the latter, we advocate using a diagonal matrix rather than the 
full covariance matrix. The constants h? and are approximately optimal scalings 


for Gaussian targets explored by Gaussian random walk or MALA proposals, see Roberts 


and Rosenthal 


(2001). We set h‘i^ = 1.65^/[dim(/d) + dim(ci;)]^/^, h? = 2.38^/dim(r 7 ) and 


hp = 1.65^/ dim(r)^/^, where dim is the dimension. The constant c is present because we 
wish to adapt the value of h in our chain using a method of|Andrieu and Thoms (2008) (see 


Taylor et ah (2013) for details) to achieve an approximately optimal acceptance rate of 0.574 
to suit the Langevin kernels, which is too high for the random walk; we set c = 0.4 roughly 
the ratio of the approximate optimal acceptance rates for random walk and Langevin kernels 
(0.234/0.574). 
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We can use a combination of maximum likelihood estimation and ad hoc methods to obtain 


initial estimates of /?, a), fj and F. Initial estimates of the parameters [d and Cj can be obtained 
via maximum likelihood, ignoring any spatial correlation between observations. Having 
obtained initial estimates of jd and cD, for each individual observation we can compute a value 
of Y which would maximise the individual contribution to the log-likelihood (conditional on 
fd and a;), taking the cell-wise mean to get an initial ad hoc value of Y for each cell with 
data. In those cells for which there are no associated data the initial Y can be set to the 
overall mean of the Ys from cells with data. Lastly, with the estimates thus obtained for 
jd, Co and Y (and hence F, given some candidate values of fj), we can construct a quadratic 
approximation to the conditional posterior 7r(r7|/9, a), F, data) on a grid spread out over the 
range of the prior for i); this approximation can be maximised exactly to get an initial value 
for fj, and can be derived exactly from the curvature of the quadratic surface at the 
maximum. 

Conditional on the initial guess at (d, Co, fj and F, the matrices and Sr can be obtained 
from the second derivative of the posterior with respect to those parameters. Although we 
have an initial guess at F from the above, we start the MCMC algorithm with F = 0, which 
works well. 


An open-source piece of software is available for htting this model using the above method 


in the form of the R package spatsurv, see Taylor and Rowlingson (2014). 


3.4 Spatial Prediction 

Having htted this model using MCMC, we will have a sample {y}^\ ..., Ym^}^^, where N 
is the number of retained iterations, which can be used for spatial prediction directly. The 
contribution to the likelihood of grid cells without observations is slightly subtle, whilst they 
obviously contribute to the posterior, their joint posterior probability density 7r(Yi:m|data), 
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is partly mediated through the prior 7r(r) and partly through their involvement in the trans¬ 
formation in Equation (|^. The upshot is that we obtain unbiased joint Bayesian inference 
for all cells on the grid. 


4 Simulation Study: How Much Faster is the New 
Method? 

In this section we present some simulation results to give the reader an idea of the com¬ 
parative computational burden of the proposed method using Fourier methods for matrix 
operations with the standard method using ordinary matrix operations. The discussion on 
complexity here refers to the cost of implementing one iteration of the MCMC sampler. All 
computations in this article were performed on a 3.40GHz Intel(R)Core™i7-4770 desktop 
computer. 

As mentioned briefly earlier in this article, the standard method using ordinary matrix 
operations has complexity O(n^), where n is the number of iterations. For the proposed 
Fourier methods the complexity is a little more subtle: at each iteration the matrix operations 
(such as inversion and computation of the square root) will have a cost scaling with mlogm 
but evaluation of the remainder of the posterior scales as 0{n). Thus for a hxed grid size, 
the cost of adding additional observations scales linearly with the number of observations. 

We used MCMC to £t model ([^ on datasets containing 50, 100, 200, 300, 500, 1000 and 
2000 observations and recorded the time in seconds the MCMC algorithm taks to run for 
1000 iterations; the time recorded does not include pre-processing. We did not run a similar 
analysis for Model ([^ as it has the same computational complexity as model ([^, so the 
plots would look similar. This simulation study only considers how long it takes to generate 
a sample of size 1000: for the standard model, if it is desired to perform spatial prediction. 
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there is an additional cost on top of this - the ontpnt of the new model inclndes the joint 
posterior of all Fs on the extended grid (thongh only the Y s covering the observation window 
are likely to be of relevance). 



Fignre 2: Plot showing the time taken to run 1000 iterations of the MCMC algorithm 
using: standard matrix operations (solid line) and Fourier methods on a 32 x 32 grid (short 
dashed), 64 x 64 grid(dotted), 128 x 128 grid (dot-dash) and 256 x 256 grid (long dash). 
Note for example that the output 256 x 256 grid actually runs on a 512 x 512 extended grid 
(i.e. is predicting over 1/4 million variables); each of the other grids is extended by 2 in each 
direction. 


Figure shows the results from this experiment; the plot conhrms the rate of increase in 
computational cost for both methods. For small datasets, the new method is slower than the 
standard method, however, by the time there are 500 observations in the dataset, depending 
on the size of the output grid, the new method is faster: this is the case for output grids of 
size 32 X 32 and 64 x 64, but not for 128 x 128 or 256 x 265. By the time the dataset is of 
size 1000, the new method is at least twice as fast for grid sizes up to 128 x 128; for datasets 
of size 2000, the new method is 21.2 times faster for a 32 x 32 grid, 9.9 times faster for a 
64 X 64 grid, 6.6 times faster for a 128 x 128 grid and 3.4 times faster for a 256 x 256 grid. 
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5 Application: Spatial Survival Analysis of Leukaemia 


Data 


In this section, we analyse the leukaemia data from Henderson et ah (2002) to demonstrate 


that the new method is both practical and gives comparable estimates to an established 
method; the reader should refer to this article for a full description of these data. The 
leukaemia data are illustrated in Figure which shows the 1043 observations plotted over 
the north west and south east areas of Lancashire in the United Kingdom. These data 
were initially provided on the unit square and for the purpose of interpretability of the 
spatial decay parameter, were approximately transformed onto the Ordnance Survey of Great 
Britain projection, EPSG 27700. 

Alongside the 1043 observed survival times and (right) censoring indicator, the dataset in¬ 
cludes covariate information on each subject’s age, sex, white blood cell count (WBG) and 


a measure of deprivation (the Townsend index). In their original article, Henderson et ah 


(2002) used gamma frailties, the Breslow estimator (Breslow, 1974) for the baseline hazard 


and the Matern covariance function with smoothness parameter 1 for the covariance function 
of the spatially correlated frailties. In comparison, in the present article we use model ([^ 
with log-Gaussian frailties and a Weibull parametric model for the baseline hazard. The 
Weibull baseline hazard takes the form. 


ho(t] cr. A) — cvXt 


a—1. 


( 12 ) 


this choice gives some flexibility regarding the shape of Hq and allows us to produce posterior 
estimates of the baseline hazard and individual hazard/survival/density curves with credible 


intervals. For even greater flexibility in modelling ho it is possible to use H-splines De Boor 


(1978), which are also implemented in the spatsurv package, whilst the increased flexibility 
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Figure 3: Plot of the leukaemia data, spanning north west and south east Lancashire in 
the United Kingdom. The red dots represent uncensored observations, the black crosses are 
right-censored observations; the size of the plotting character is proportional to the observed 
time. 
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offered by these models can sometimes be desirable they can also lead to over-htting. 
We assnmed an exponential covariance fnnction for Y, i.e. 


cov{Yi,Yj) = exp{\\ci - Cj\\/(f)} 


where for example q is the coordinates of the ith observation and || ■ || denotes Euclidean 
distance. The quantity is the marginal variance of the latent held and 0 is the spatial decay 
parameter, larger values meaning more spatial dependence. The choice to use the exponential 
model was arbitrary - we could equally have chosen the Matern model with roughness 


parameter 1 used in Henderson et al. (2002), but since the two models are fundamentally 
different in any case it was thought to be of interest to compare the spatial predictions from 
both models. 


We placed the 1043 observations inside a 64 x 64 grid of square cells of dimension 1650 x 1650 
metres and inference took place on an extended 128 x 128 grid. We used a N(0,10^) prior for 
both 13 and logo;, a N(0, 0.5^) prior for log a and a N(log5000, 0.3^) prior for 0. The MCMC 
sampler was run for 1100000 iterations with a 100000 iteration burn-in and retaining every 
1000th sample, the run took 11.5 hours - see Section]^ for a method of speeding this up on 
a multi-core machine. The parameters jS, logo;, log 77 and T were initialised as described in 
section 13.31 

To check for convergence, we examined a plot of the value of the log-posterior over the 
retained iterations, this showed that the sampler had appeared to have left the transient 
phase and that it had found a mode, see Figure We furthermore examined trace and 
autocorrelation plots of the parameters (3, u = {a, A) and rj and a selection of 20 ran¬ 
domly chosen Es, these can be viewed at http: //www. lancaster. ac.uk/staff/taylorbl/ 
amlmcmc/MCMCoutput .html, For the remaining Fs, we examined a plot of the lag-1 auto¬ 
correlations, these ranged between -0.12 and 0.13 with the 95% quantiles between ±0.07. 
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We also overlaid plots of the prior and posterior to check whether each of the parameters 
was well-identified by the data; this was found to be the case with all but the spatial decay 
parameter 0, as expected - see Figure 

Having provided evidence for convergence of the MCMC algorithm, we can proceed to make 
statistical inferences. Estimates of the the parameters (3, oj = {a, (3) and r] from this model 
are given in Tabl^ These results can be compared with Table 5 in Henderson et ah (2002) 
and it can be seen that these show a similar story with some minor differences in the estimates 
of f3 and their credible intervals, as might be expected from these two different models. The 
other parameters {u and rj) are not directly comparable. 


parameter 

median 

2.5% 

97.5% 

age 

3.38x10-^ 

2.94x10-2 

3.82x10-2 

sex 

6.45x10-2 

-8.29x10-2 

0.194 

WBC 

3.2x10-3 

2.31x10-3 

4.13x10-3 

deprivation 

2.92x10-2 

8.25x10-3 

5.16x10-2 

a 

0.611 

0.578 

0.649 

A 

3.02x10-3 

1.95x10-3 

4.5x10-3 

a 

0.387 

0.266 

0.546 

(j) (metres) 

5316 

2958 

9521 


Table 1: Table showing parameter estimates for the leukaemia analysis. 


Henderson et ah (2002 


It is also of interest to compare a plot of the mean spatially-correlated frailties expjH} with 


that presented in Figure 6 of Henderson et ah (2002), the former are shown in the top-left 
plot of Figure]^ - and comparing this to the original, it can be seen that there is excellent 
corroboration. 


Since we have at our disposal a sample from the joint posterior of all model parameters, 
we are at liberty to construct other summary measures of interest. For example, the plot 
of the mean surface expjH} is in some sense dehcient, as it gives no indication as to how 
precise the estimates of the latent held are at each point. To address this, we can plot 
exceedance probabilities: the posterior probability that the exponential of the latent held 
exceeds some threshold c. The interpretation of exp{y} > c for some threshold c > 1 is as a 
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Figure 4: Top left: predicted exp{y}, to be compared with Figure 6 from Henderson et ah 
(2002). Top right: plot of P[exp{y} > 1.2]. Bottom left: plot showing the posterior median 
baseline hazard and 95% credible interval. Bottom right: plot showing the posterior median 
of the spatial covariance function and 95% credible interval. 
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multiplicative increase in hazard rate at the spatial location of the particular Y concerned. 
The threshold c should really be set by specialist clinicians. The top right plot of Figure 
1^ is a plot of exceedance probabilities derived from the new model, in this case showing 
P[exp{y} > 1.2]. We examined further plots of exceedances over 1.5 and 2 times the hazard 
rate, but the predicted probabilities were low in these cases. Lastly (but not exhaustively), 
we can produce plots of the posterior baseline hazard function and spatial covariance function 
with credible intervals; these are shown in respectively the bottom left and bottom right plots 
in Figure 

The purpose of this example has been to demonstrate that the new model and methods give 
comparable results to a previously published study in the area of spatial survival analysis. 
A direct comparison was not possible as the two models were fundamentally different, but 
the main conclusions from these analyses would be very similar. 


6 Discussion 


In this article, we have shown how Fourier methods can be used to improve computational 
performance in spatial survival analysis and how the proposed method simultaneously solves 
the problem of spatial prediction. 


Most importantly, the new method can be used for Monte Carlo inference with large datasets 
for which it would be prohibitively slow to implement the standard 0{n^) method, thus 
making MCMC tolerable for this class of models. The speedup to 0{n) for fixed output grid 
size represents a substantial gain in computational time and the 0{m\ogm) increase in cost 
for increasing grid-size represents (to the author’s knowledge) a currently optimal complexity 
in this respect. The fact that we have provided open-source software for implementing the 
proposed method (Taylor and Rowlingson] 2014) means that researchers interested in htting 
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the models discussed in this article on their own datasets will be able to do so with minimal 
programming effort. 


One issue not discussed in the main body of this article is that the proposed method may 
not be suitable for learning about very fine scale spatial variation. The limit will in practice 
be governed by the grid size used, which itself may be dictated by the availability of com¬ 
putational resources. If very fine scale inference is required, one option would be to run the 
analysis on a smaller observation window of interest, but there must be sufficiently many 
observations in the smaller window to make this worthwhile. As noted in the discussion of 


Banerjee et ah (2008): “learning about fine scale spatial dependence is always a challenge”. 


This research arose out of the desire to solve a particular problem in survival analysis, but 
the idea is more widely applicable as will now be illustrated. More generally, the proposed 


method can be used for any type of classical geostatistical analyses (Diggle et ah, 1998 


Gelfand et ah [2010 ) in which data have been recorded at point-locations and it is desired to 
(i) make inferences about the effects of individual-level covariates, whilst (ii) adjusting for 
potential spatial-correlation between observations through some unobserved environmental 
exposure and (hi) make predictions of functions of the latent field at points in space where 
we do not have data. This includes semi-parametric and proportional odds spatial survival 
models, which can be thought of as belonging to the class of geostatistical models. As an 
example, consider a Poisson geostatistical model for an observed count Zp 


Poisson(i?j). 

log Ri = -Aj/3 + Yi + Ui, 


where Ij is a spatially correlated term and Ui are non-spatial effects. The second of these 
equations could simply be replaced by 


log Ri — Xij3 -\- Yg[i] + Ui- 










whence we again see a computational speed-up from O(n^) to 0{n) for fixed output grid 
size. 

For spatiotemporal datasets, the idea is very similar. Suppose that {Yt} is a stationary 
spatiotemporal Gaussian process with separable a separable covariance function such that 
we may write, 

Yt = + [1 — + [1 ~ 

where ^ [0,1] and ~ N(0, Sy). Then if Fq ~ N(/i, Sy) this implies that ~ N(/i, Sy) 
for all t. Dehning the relationship between Ft and as Ft = — Z^) or equivalently 

Yt = fi + Sy^Ft, it can be shown that Ft|Ft_i ~ N[at_t,tFt_i, 1 — whence assuming 

Fq ~ N(0, 1), we are able to evaluate the judicious prior from the decomposition, 

7r(Fo,..., Fy) = 7r(Fo)7r(Fi|Fo) ■ ■ ■ 7r(Fy|Fy_i) 


for all T. 

Retaining the same parameter interpretation as above, the spatiotemporal equivalent of the 
Poisson geostatistical model would then be, 

Zi^t ~ Poisson(Rj^t). 
logRi,t = Xi^t/3+ Yg[i^t] + Ui^t, 

where for example denotes the ith observed count at time t and Fg[i,i] is the value of the 
process {Ft} at the centroid of the spatiotemporal grid containing this observation. For hxed 
output grid size, we retain an 0{n) increase with the number of observations and 0{m logm) 
for increasing spatial resolution; increasing the grid resolution in the temporal domain is at 
a cost of 0{T) 
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For users with access to multi-core machines, there is a further MCMC speed-up available, 
not discussed in the main body of this article. The idea is to use multiple cores to propose 
new sets of candidate parameters simultaneously, then if the hrst and second (/3*, a;*, r;*, F*) 
are rejected, and the third accepted, say, the MCMC can move using the third proposal. 
No correction to the acceptance probability is required, because we are not conditioning on 


the rejected points as in Green and Mira (2001) for example. This method works because 
we are targeting a particular acceptance rate, 0.574 in this case, and it is straightforward 
to compute the efficiency gain in using multiple cores in this way, for the overall acceptance 
rate assuming each ‘chance of success’ is 0.574 independently on an /c-core machine follows 
an upper tail cumulative probability from a Binomial(fc, 0.574) probability density function. 
Thus on an 8-core machine, one can increase the acceptance rate to 99.9% with an associated 
relatively small increase in the cost of managing the 8 simultaneous proposals. The relative 
beneht for MCMC schemes targeting lower acceptance rates, such as 0.234, are even greater; 
on an 8 core machine, this is upped to 88.1%. So for the examples in this article, it would 
be relatively straightforward to reduce the computational time by approximately 40%, for 
both the standard and Fourier-based methods. 
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A MCMC Plots 
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Figure 5: Plot showing the log of the posterior over the retained iterations. The initial 
value of the log posterior is also illustrated to show that the chain left the transient phase 
and appears to have found a mode. 
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Figure 6: Plot showing the prior (red line) and posterior (histogram) for each parameter. 
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